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Abstract 

Path queries on a knowledge graph can 
be used to answer compositional ques¬ 
tions such as “What languages are spoken 
by people living in Lisbon?”. However, 
knowledge graphs often have missing facts 
(edges) which disrupts path queries. Re¬ 
cent models for knowledge base comple¬ 
tion impute missing facts by embedding 
knowledge graphs in vector spaces. We 
show that these models can be recursively 
applied to answer path queries, but that 
they suffer from cascading errors. This 
motivates a new “compositional” training 
objective, which dramatically improves all 
models’ ability to answer path queries, in 
some cases more than doubling accuracy. 

On a standard knowledge base comple¬ 
tion task, we also demonstrate that com¬ 
positional training acts as a novel form of 
structural regularization, reliably improv¬ 
ing performance across all base models 
(reducing errors by up to 43%) and achiev¬ 
ing new state-of-the-art results. 

1 Introduction 

Broad-coverage knowledge bases such as Free- 
base (Bollacker et ah, 2008) support a rich array 
of reasoning and question answering applications, 
but they are known to suffer from incomplete cov¬ 
erage (Min et ah, 2013). For example, as of May 
2015, Freebase has an entity Tad Lincoln (Abra¬ 
ham Lincoln’s son), but does not have his ethnic¬ 
ity. An elegant solution to incompleteness is using 
vector space representations: Controlling the di¬ 
mensionality of the vector space forces generaliza¬ 
tion to new facts (Nickel et ah, 2011; Nickel et ah, 
2012; Socher et ah, 2013; Riedel et ah, 2013; Nee- 
lakantan et ah, 2015). In the example, we would 
hope to infer Tad’s ethnicity from the ethnicity of 
his parents. 



Figure 1: We propose performing path queries 
such as tad_lincoln/parents/location {“Where 
are Tad Lincoln’s parents located?”) in a parallel 
low-dimensional vector space. Here, entity sets 
(boxed) are represented as real vectors, and edge 
traversal is driven by vector-to-vector transforma¬ 
tions (e.g., matrix multiplication). 

However, what is missing from these vector 
space models is the original strength of knowledge 
bases: the ability to support compositional queries 
(Ullman, 1985). For example, we might ask 
what the ethnicity of Abraham Lincoln’s daugh¬ 
ter would be. This can be formulated as a path 
query on the knowledge graph, and we would like 
a method that can answer this efficiently, while 
generalizing over missing facts and even missing 
or hypothetical entities (Abraham Lincoln did not 
in fact have a daughter). 

In this paper, we present a scheme to answer 
path queries on knowledge bases by “composi- 
tionalizing” a broad class of vector space mod¬ 
els that have been used for knowledge base com¬ 
pletion (see Figure 1). At a high level, we inter¬ 
pret the base vector space model as implementing 
a soft edge traversal operator. This operator can 
then be recursively applied to predict paths. Our 
interpretation suggests a new compositional train¬ 
ing objective that encourages better modeling of 











paths. Our technique is applicable to a broad class 
of composable models that includes the bilinear 
model (Nickel et ah, 2011) and TransE (Bordes et 
ah, 2013). 

We have two key empirical findings: First, we 
show that compositional training enables us to 
answer path queries up to at least length 5 by 
substantially reducing cascading errors present in 
the base vector space model. Second, we find 
fhaf somewhaf surprisingly, composifional frain- 
ing also improves upon slale-of-lhe-arf perfor¬ 
mance for knowledge base complefion, which is a 
special case of answering unif lengfh pafh queries. 
Therefore, composifional fraining can also be seen 
as a new form of sfrucfural regularization for ex¬ 
isting models. 

2 Task 

We now give a formal definition of fhe fask of an¬ 
swering pafh queries on a knowledge base. Eel 
S he. n sef of enfifies and 72. be a sef of binary 
relations. A knowledge graph Q is defined as a 
sef of friples of fhe form (s,r, t) where s,t ^ S 
and r ^ TZ. An example of a friple in Freebase is 

(tad_lincoln, parents, abraham_lincoln). 

A pafh query q consisfs of an inifial anchor en¬ 
tity, s, followed by a sequence of relations fo be 
fraversed, p = (ri,..., r^). The answer or deno- 
fafion of fhe query, [[( 7 ]], is fhe sef of all enfifies fhaf 
can be reached from s by fraversing p. Formally, 
Ibis can be defined recursively: 

w = {4, (1) 

Iq/rj {f : 3s e Iqj, (s, r, t) £ G} . ( 2 ) 

For example, tad_lincoln/parents/location is a 
query q fhaf asks: “Where did Tad Lincoln’s par¬ 
ents live?”. 

For evaluation (see Secfion 5 for defails), we de¬ 
fine fhe sef of candidafe answers fo a query C{q) 
as fhe sef of all enfifies fhaf “fype mafch”, namely 
fhose fhaf parficipafe in fhe final relation of q af 
leasf once; and lef AA(( 7 ) be fhe incorrecf answers: 

C (s/n/ ■■■ /rk) = {f I 3e, (e, rk,t) £ G} (3) 

AA((7) = C(q)\M. (4) 

Knowledge base completion. Knowledge base 
complefion (KBC) is fhe fask of predicfing 
whefher a given edge (s, r, t) belongs in fhe graph 
or nof. This can be formulaled as a pafh query 
q = s/r wifh candidate answer t. 


3 Compositionalization 

In Ibis section, we show how fo composifional- 
ize exisfing KBC models fo answer pafh queries. 
We sfarf wifh a motivating example in Section 3.1, 
fhen presenf fhe general fechnique in Secfion 3.2. 
This suggesfs a new composifional fraining objec¬ 
tive, described in Secfion 3.3. Finally, we illus- 
frafe fhe fechnique for several more models in Sec¬ 
tion 3.4, which we use in our experimenfs. 

3.1 Example 

A common vecfor space model for knowledge 
base completion is fhe bilinear model (Nickel el 
ah, 2011). In Ibis model, we learn a vector Xe £ 
for each enfify e £ S and a mafrix Wr G 
for each relafion r G 72. Given a query s/r (ask¬ 
ing for fhe sef of enfifies connected to s via relation 
r), fhe bilinear model scores how likely t £ [s/rj 
holds using 

scorers/r, t) = xJWrXt- (5) 

To mofivafe our composifionalizafion fech- 
nique, lake d = \£\ and suppose Wr is fhe ad¬ 
jacency mafrix for relafion r and enfify vecfor x^ 
is fhe indicalor vecfor wifh a 1 in fhe enfry corre¬ 
sponding fo enfify e. Then, fo answer a pafh query 
q = s/ri / ... /rfc, we would fhen compute 

score(g', t) = x~l ... Wr/^xt- (6) 

If is easy fo verify fhaf fhe score counls fhe number 
of unique palhs belween s and t following rela¬ 
tions r\l... jrk- Hence, any t wifh positive score 
is a correcl answer ([[q]] = {t : score(q, t) > 0}). 

Fef us inlerprel (6) recursively. The model be¬ 
gins wifh an enfify vecfor Xg, and sequentially 
applies traversal operators TT,.. (n) = v'^Wr^ for 
each n. Each Iraversal operation resulfs in a 
new “sef vector” represenfing fhe enfifies reached 
al fhaf poinl in Iraversal (corresponding to fhe 
nonzero enlries of fhe sef vecfor). Finally, if ap¬ 
plies fhe membership operator M(n, xt) = v'^xt 
to check if 7 G \s/ri / ... /rk\. Writing graph 
Iraversal in Ibis way immediately suggesfs a useful 
generalization: lake d much smaller lhan \£\ and 
learn fhe parameters Wr and x^. 

3.2 General technique 

The strategy used to extend the bilinear model of 
(5) to the compositional model in (6) can be ap¬ 
plied to any composable model: namely, one that 



has a scoring function of the form: 

score(s/r, f) = M(Tr(xs), xt) (7) 

for some choice of membership operator M : 

M and traversal operator —)■ 

We can now define the vector denotation of a 
query [qjy analogous to the definition of [gj in 
(1) and (2): 

( 8 ) 

[q/rlv ='T,(My). (9) 

The score function for a compositionalized 
model is then 

score(q, t) = M([[g]]v, pjy). (10) 

We would like [qjy to approximately represent 
the set [q]] in the sense that for every e G [qj, 
M([[g]]v, [ejy) is larger than the values for e 0 
[gj. Of course it is not possible to represent all 
sets perfectly, but in the next section, we present a 
training objective that explicitly optimizes T and 
M to preserve path information. 

3.3 Compositional training 

The score function in (10) naturally suggests a new 
compositional training objective. Let {{qi, 
denote a set of path query training examples with 
path lengths ranging from 1 to L. We minimize 
the following max-margin objective: 

N 

= [l-margin(gi,fi,f')]+, 

i=l 

margin(g, t, t') = score(( 7 , t) — score(q, t'), 

where the parameters are the membership opera¬ 
tor, the traversal operators, and the entity vectors: 

0 = {M} U {Tr : r G TZ} U |xe G : e G . 

This objective encourages the construction of 
“set vectors”: because there are path queries of 
different lengths and types, the model must learn 
to produce an accurate set vector [qjy after any 
sequence of traversals. Another perspective is 
that each traversal operator is trained such that 
its transformation preserves information in the 
set vector which might be needed in subsequent 
traversal steps. 

In contrast, previously proposed training objec¬ 
tives for knowledge base completion only train on 


queries of path length 1. We will refer to this spe¬ 
cial case as single-edge training. 

In Section 5, we show that compositional train¬ 
ing leads to substantially better results for both 
path query answering and knowledge base com¬ 
pletion. In Section 6, we provide insight into why. 

3.4 Other composable models 

There are many possible candidates for T and M. 
For example, T could be one’s favorite neural net¬ 
work mapping from to Here, we focus on 
two composable models that were both recently 
shown to achieve state-of-the-art performance on 
knowledge base completion. 

TransE. The TransE model of Bordes et al. 
(2013) uses the scoring function 

score(s/r, t) = —||xs Wr — (H) 

where Xg, Wr and xt are all d-dimensional vectors. 

In this case, the model can be expressed using 
membership operator 

M(u,xt) =-||u - xtlll (12) 

and traversal operator Tr(xs) = Xg Wr- 
Hence, TransE can handle a path query q = 
s/n/r’ 2 /- • • /rk using 

score(q,f) = -[[x^ -h - Xt\\l. 

We visualize the compositional TransE model in 
Eigure 2. 

Bilinear-Diag. The Bilinear-Diag model of 
Yang et al. (2015) is a special case of the bilinear 
model with the relation matrices constrained to be 
diagonal. Alternatively, the model can be viewed 
as a variant of TransE with multiplicative interac¬ 
tions between entity and relation vectors. 

Not all models can be compositionalized. It 

is important to point out that some models are 
not naturally composable—for example, the latent 
feature model of Riedel et al. (2013) and the neu¬ 
ral tensor network of Socher et al. (2013). These 
approaches have scoring functions which combine 
s, r and f in a way that does not involve an inter¬ 
mediate vector representing s/r alone without t, 
so they do not decompose according to (7). 




WordNet 

Freebase 

Relations 

11 

13 

Entities 

38,696 

75,043 

„ Train 

Tesl 

112,581 

316,232 

10,544 

23,733 

Paths 

Test 

2,129,539 

6,266,058 

46,577 

109,557 


Table 1: WordNet and Freebase statistics for base 
and path query datasets. 

3.5 Implementation 

We use AdaGrad (Duchi et al., 2010) to optimize 
J(0), which is in general non-convex. Initial¬ 
ization scale, mini-batch size and step size were 
cross-validated for all models. We initialize all 
parameters with i.i.d. Gaussians of variance 0.1 in 
every entry, use a mini-batch size of 300 examples, 
and a step size in [0.001,0.1] (chosen via cross- 
validation) for all of the models. For each exam¬ 
ple q, we sample 10 negative entities t' G M{q). 
During training, all of the entity vectors are con¬ 
strained to lie on the unit ball, and we clipped the 
gradients to the median of the observed gradients 
if the update exceeded 3 times the median. 

We first train on path queries of length 1 until 
convergence and then train on all path queries until 
convergence. This guarantees that the model mas¬ 
ters basic edges before composing them to form 
paths. When training on path queries, we explic¬ 
itly parameterize inverse relations. For the bilinear 
model, we initialize W^-i with Wj. For TransE, 
we initialize Wj.-i with —Wr- For Bilinear-Diag, 
we found initializing with the exact inverse 
1/wr is numerically unstable, so we instead ran¬ 
domly initialize w^-i with i.i.d Gaussians of vari¬ 
ance 0.1 in every entry. Additionally, for the bi¬ 
linear model, we replaced the sum over M{qi) in 
the objective with a max since it yielded slightly 
higher accuracy. Our models are implemented us¬ 
ing Theano (Bastien et al., 2012; Bergstra et al., 
2010 ). 

4 Datasets 

In Section 4.1, we describe two standard knowl¬ 
edge base completion datasets. These consist of 
single-edge queries, so we call them base datasets. 
In Section 4.2, we generate path query datasets 
from these base datasets. 


4.1 Base datasets 

Our experiments are conducted using the sub¬ 
sets of WordNet and Freebase from Socher et al. 
(2013). The statistics of these datasets and their 
splits are given in Table 1. 

The WordNet and Freebase subsets exhibit sub¬ 
stantial differences that can influence model per¬ 
formance. The Freebase subset is almost bipartite 
with most of the edges taking the form (s, r, t) for 
some person s, relation r and property t. In Word- 
Net, both the source and target entities are arbi¬ 
trary words. 

Both the raw WordNet and Freebase contain 
many relations that are almost perfectly correlated 
with an inverse relation. For example, WordNet 
contains both has.part and part.of, and Freebase 
contains both parents and children. At test time, 
a query on an edge (s, r, t) is easy to answer if the 
inverse triple {t, r~^, s) was observed in the train¬ 
ing set. Following Socher et al. (2013), we ac¬ 
count for this by excluding such “trivial” queries 
from the test set. 

4.2 Path query datasets 

Given a base knowledge graph, we generate path 
queries by performing random walks on the graph. 
If we view compositional training as a form of reg¬ 
ularization, this approach allows us to generate ex¬ 
tremely large amounts of auxiliary training data. 
The procedure is given below. 

Let Strain bc the training graph, which consists 
only of the edges in the training set of the base 
dataset. We then repeatedly generate training ex¬ 
amples with the following procedure: 

1. Uniformly sample a path length L G 

Umax}, and uniformly sample a start¬ 
ing entity s £ £. 

2. Perform a random walk beginning at entity s 
and continuing L steps. 

(a) At step i of the walk, choose a relation 
Vi uniformly from the set of relations in¬ 
cident on the current entity e. 

(b) Choose the next entity uniformly from 
the set of entities reachable via r^. 

3. Output a query-answer pair, (q, t), where q = 
sjrxl ■ ■ ■ and t is the final entity of the 
random walk. 




In practice, we do not sample paths of length 1 and 
instead directly add all of the edges from Strain to 
the path query dataset. 

To generate a path query test set, we repeat 
the above procedure except using the graph ^fuii, 
which is Strain plus all of the test edges from the 
base dataset. Then we remove any queries from 
the test set that also appeared in the training set. 
The statistics for the path query datasets are pre¬ 
sented in Table 1. 

5 Main results 

We evaluate the models derived in Section 3 on 
two tasks: path query answering and knowledge 
base completion. On both tasks, we show that the 
compositional training strategy proposed in Sec¬ 
tion 3.3 leads to substantial performance gains 
over standard single-edge training. We also com¬ 
pare directly against the KBC results of Socher et 
al. (2013), demonstrating that previously inferior 
models now match or outperform state-of-the-art 
models after compositional training. 

Evaluation metric. Numerous metrics have 
been used to evaluate knowledge base queries, in¬ 
cluding hits at 10 (percentage of correct answers 
ranked in the top 10) and mean rank. We evaluate 
on hits at 10, as well as a normalized version of 
mean rank, mean quantile, which better accounts 
for the total number of candidates. For a query q, 
the quantile of a correct answer t is the fraction of 
incorrect answers ranked after t: 

|{f' eAf{q) : score(q,t') < score(q,t)}| 

|AA(q)| 

The quantile ranges from 0 to 1, with 1 being opti¬ 
mal. Mean quantile is then defined to be the aver¬ 
age quantile score over all examples in the dataset. 
To illustrate why normalization is important, con¬ 
sider a set of queries on the relation gender. A 
model that predicts the incorrect gender on ev¬ 
ery query would receive a mean rank of 2 (since 
there are only 2 candidate answers), which is fairly 
good in absolute terms, whereas the mean quantile 
would be 0, rightfully penalizing the model. 

As a final nofe, several of fhe queries in fhe 
Freebase pafh dafasef are “fype-mafch frivial” in 
fhe sense fhaf all of fhe fype mafching candidafes 
C{q) are correcf answers fo fhe query. In fhis case, 
mean quanfile is undefined and we exclude such 
queries from evaluafion. 


Overview. The upper half of Table 2 shows 
fhaf compositional fraining improves pafh query¬ 
ing performance across all models and mefrics on 
bofh dafasefs, reducing error by up fo 76.2%. 

The lower half of Table 2 shows fhaf surpris¬ 
ingly, composifional fraining also improves per¬ 
formance on knowledge base complefion across 
almosf all models, mefrics and dafasefs. On Word- 
Nef, TransE benefifs fhe mosf, wifh a 43.3% re- 
ducfion in error. On Freebase, Bilinear benefifs 
fhe mosf, wifh a 38.8% reducfion in error. 

In terms of mean quantile, the best overall 
model is TransE (Comp). In terms of hits at 10, the 
best model on WordNet is Bilinear (Comp), while 
the best model on Ereebase is TransE (Comp). 

Deduction and Induction. Table 3 takes a 
deeper look at performance on path query answer¬ 
ing. We divided path queries into two subsets: de¬ 
duction and induction. The deduction subset con¬ 
sists of queries q = sjp where the source and tar¬ 
get entities [qj are connected via relations p in the 
training graph Strain, but the specific query q was 
never seen during training. Such queries can be 
answered by performing explicit traversal on the 
training graph, so this subset tests a model’s abil¬ 
ity to approximate the underlying training graph 
and predict the existence of a path from a collec¬ 
tion of single edges. The induction subset consists 
of all other queries. This means that at least one 
edge was missing on all paths following p from 
source to target in the training graph. Hence, this 
subset tests a model’s generalization ability and its 
robustness to missing edges. 

Performance on the deduction subset of the 
dataset is disappointingly low for models trained 
with single-edge training: they struggle to answer 
path queries even when all edges in the path query 
have been seen at training time. Compositional 
training dramatically reduces these errors, some¬ 
times doubling mean quantile. In Section 6, we 
analyze how this might be possible. After com¬ 
positional training, performance on the harder in¬ 
duction subset is also much stronger. Even when 
edges are missing along a path, the models are able 
to infer them. 

Intcrprctablc queries. Although our path 
datasets consists of random queries, both datasets 
contain a large number of useful, interpretable 
queries. Results on a few illustrative examples are 
shown in Table 4. 




Bilinear 

Bilinear-Diag 

TransE 

Path query task 

Single 

Comp 

(%red) 

Single 

Comp 

(%red) 

Single 

Comp 

(%red) 

WordNet 

MQ 

84.7 

89.4 

30.7 

59.7 

90.4 

76.2 

83.7 

93.3 

58.9 

H@10 

43.6 

54.3 

19.0 

7.9 

31.1 

25.4 

13.8 

43.5 

34.5 

Freebase 

MQ 

58.0 

83.5 

60.7 

57.9 

84.8 

63.9 

86.2 

88 

13.0 

H@10 

25.9 

42.1 

21.9 

23.1 

38.6 

20.2 

45.4 

50.5 

9.3 

KBC task 

Single 

Comp 

(%red) 

Single 

Comp 

(%red) 

Single 

Comp 

(%red) 

WordNet 

MQ 

76.1 

82.0 

24.7 

76.5 

84.3 

33.2 

75.5 

86.1 

43.3 

H@10 

19.2 

27.3 

10.0 

12.9 

14.4 

1.72 

4.6 

16.5 

12.5 

Freebase 

MQ 

85.3 

91.0 

38.8 

84.6 

89.1 

29.2 

92.7 

92.8 

1.37 

H@10 

70.2 

76.4 

20.8 

63.2 

67.0 

10.3 

78.8 

78.6 

-0.9 


Table 2: Path query answering and knowledge base completion. We eompare the performanee of 
single-edge training (Single) vs eompositional training (Comp). MQ: mean quantile, H@10: hits at 10, 
%red: pereentage reduetion in error. 


Interpretable Queries 

Bilinear Single 

Bilinear Comp 

X/institution/institution~^/profession 

50.0 

93.6 

X/parents/religion 

81.9 

97.1 

X/nationality/nationality~^/ethnicity”^ 

68.0 

87.0 

X/has _part/has _ins t ance ^ ^ 

92.6 

95.1 

X/type_of/type_of / type_of 

72.8 

79.4 


Table 4: Path query performanee (mean quantile) on a seleetion of interpretable queries. We eompare 
Bilinear Single and Bilinear Comp. Meanings of eaeh query (deseending): “What professions are there 
at X’s institution?”; “What is the religion of X’s parents?”; “What are the ethnieities of people from the 
same eountry as X?”; “What types of parts does X have?”; and the transitive “What is X a type of?”. 
(Note that a relation r and its inverse do not neeessarily eaneel out if r is not a one-to-one mapping. 
For example, x/institution/institution~^ denotes the set of all people who work at the institution X 
works at, whieh is not just X.) 


Path query task 

WordNet 

Ded. Ind. 

Ereebase 

Ded. Ind. 

Bilinear 

Single 

96.9 

66.0 

49.3 

49.4 


98.9 

75.6 

82.1 

70.6 

Comp 

Bi-Diag 

Single 

56.3 

51.6 

49.3 

50.2 

Comp 

98.5 

78.2 

84.5 

72.8 

TransE 

Single 

92.6 

71.7 

85.3 

72.4 


99.0 

87.4 

87.5 

76.3 

Comp 


Table 3: Deduction and induction. We eompare 
mean quantile performanee of single-edge training 
(Single) vs eompositional training (Comp). Length 
1 queries are exeluded. 

Comparison with Socher et al. (2013). Here, 
we measure performanee on the KBC task in terms 
of the aeeuraey metrie of Soeher et al. (2013). 
This evaluation involves sampled negatives, and is 
henee noisier than mean quantile, but makes our 
results direetly eomparable to Soeher et al. (2013). 
Our results show that previously inferior models 


sueh as the bilinear model ean outperform state- 
of-the-art models after eompositional training. 

Soeher et al. (2013) proposed parametrizing 
eaeh entity veetor as the average of veetors of 
words in the entity (tutad_iincoin = ^('W^tad + 
miincoin)? and pretraining these word veetors us¬ 
ing the method of Turian et al. (2010). Table 5 
reports results when using this approaeh in eon- 
junetion with eompositional training. We initial¬ 
ized all models with word veetors from Penning¬ 
ton et al. (2014). We found that eomposition- 
ally trained models outperform the neural tensor 
network (NTN) on WordNet, while being only 
slightly behind on Freebase. (We did not use word 
veetors in any of our other experiments.) 

When the strategy of averaging word veetors to 
form entity veetors is not applied, our eomposi¬ 
tional models are signifieantly better on WordNet 
and slightly better on Freebase. It is worth noting 
that in many domains, entity names are not lexi- 
eally meaningful, so word veetor averaging is not 



Accuracy 

WordNet 

EV WV 

Freebase 

EV WV 

NTN 

70.6 

86.2 

87.2 

90.0 

Bilinear Comp 

77.6 

87.6 

86.1 

89.4 

TransE Comp 

80.3 

84.9 

87.6 

89.6 


Table 5: Model performance in terms of accu¬ 
racy. EV: entity vectors are separate (initialized 
randomly); WV: entity vectors are average of word 
vectors (initialized with pretrained word vectors). 

always meaningful. 

6 Analysis 

In this section, we try to understand why com¬ 
positional training is effective. For concrete¬ 
ness, everything is described in terms of the bi¬ 
linear model. We will refer to the compositionally 
trained model as Comp, and the model trained with 
single-edge training as Single. 

6.1 Why does compositional training 
improve path query answering? 

It is tempting to think that if Single has accurately 
modeled individual edges in a graph, it should ac¬ 
curately model the paths that result from those 
edges. This intuition turns out to be incorrect, as 
revealed by Single’s relatively weak performance 
on the path query dataset. We hypothesize that this 
is due to cascading errors along the path. For a 
given edge {s, r, t) on the path, single-edge train¬ 
ing encourages xt to be closer to x J Wr than any 
other incorrect xt' ■ However, once this is achieved 
by a margin of 1, it does not push xt any closer to 
x~l Wr- The remaining discrepancy is noise which 
gets added at each step of path traversal. This is 
illustrated schematically in Figure 2. 

To observe this phenomenon empirically, we 
examine how well a model handles each interme¬ 
diate step of a path query. We can do this by 
measuring the reconstruction quality (RQ) of the 
set vector produced after each traversal operation. 
Since each intermediate stage is itself a valid path 
query, we define RQ to be the average quantile 
over all entities that belong in [qj: 

RQ(q) = ^ quantile (q, t) (14) 

teM 

When all entities in [[qj are ranked above all in¬ 
correct entities, RQ is 1. In Figure 3, we illustrate 
how RQ changes over the course of a query. 


O--- 9 9 

tad lincoln ! I 

■“ I I 

O- - ; 

abraham^lincoln ' 

o 

thomas_lincoln 

Figure 2: Cascading errors visualized for 
TransE. Each node represents the position of an 
entity in vector space. The relation parent is 
ideally a simple horizontal translation, but each 
traversal introduces noise. The red circle is where 
we expect Tad’s parent to be. The red square is 
where we expect Tad’s grandparent to be. Dotted 
red lines show that error grows larger as we tra¬ 
verse farther away from Tad. Compositional train¬ 
ing pulls the entity vectors closer to the ideal ar¬ 
rangement. 

Given the nature of cascading errors, it might 
seem reasonable to address the problem by adding 
a term to our objective which explicitly encour¬ 
ages xjWr to be as close as possible to xt- With 
this motivation, we tried adding A||xJH4 — xtHl 
term to the objective of the bilinear model and a 
X\\xs -\-Wr — xt \\2 term to the objective of TransE. 
We experimented with different settings of A over 
the range [0.001,100]. In no case did this addi¬ 
tional £2 term improve Single’s performance on 
the path query or single edge dataset. These re¬ 
sults suggest that compositional training is a more 
effective way to combat cascading errors. 

6.2 Why does compositional training 

improve knowledge base completion? 

Table 2 reveals that Comp also performs better on 
the single-edge task of knowledge base comple¬ 
tion. This is somewhat surprising, since Single 
is trained on a training set which distributionally 
matches the test set, whereas Comp is not. How¬ 
ever, Comp’s better performance on path queries 
suggests that there must be another factor at play. 
At a high level, haining on paths must be provid¬ 
ing some form of shuctural regularization which 
reduces cascading errors. Indeed, paths in a 
knowledge graph have proven to be important fea¬ 
tures for predicting the existence of single edges 
(Fao et al., 2011; Neelakantan et al., 2015). For 
example, consider the following Horn clause: 

parents (x, y) A location (y, z) place_of_birth (x, z) , 
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Figure 3: Reconstruction quality (RQ) at each step 
of the query tad_lincoln/parents/place_of_birth/ 
place_of_birth~^/profession. COMP experiences 
significantly less degradation in RQ as path length 
increases. Correspondingly, the set of 5 highest 
scoring entities computed at each step using Comp 
(green) is significiantly more accurate than the set 
given by Single (blue). Correct entities are bolded. 


which states that if x has a parent with location 
z, then X has place of birth z. The body of the 
Horn clause expresses a path from x to z. If Comp 
models the path better, then it should be better able 
to use that knowledge to infer the head of the Horn 
clause. 

More generally, consider Horn clauses of the 
form p => r, where p = ri / ... /r^ is a path type 
and r is the relation being predicted. Let us focus 
on Horn clauses with high precision as defined by: 

(.5, 

where [pj is fhe sef of enfify pairs connecfed by p, 
and similarly for [rj. 

Infuifively, one way for fhe model fo implicifly 
learn and exploif such a Horn clause would be fo 
safisfy fhe following fwo criteria: 

1. The model should ensure a consistent spa- 
fial relationship befween enfify pairs fhaf are 
related by fhe pafh fype p; fhaf is, keeping 
xj Wri ... Wr^ close fo Xt for all valid (s, t) 
pairs. 

2. The model’s represenfafion of fhe pafh fype p 
and relafion r should capfure fhaf spafial re¬ 
lationship; fhaf is, x~l Wri ■ ■ ■ Wr^. ~ Xt im¬ 
plies xj Wr ~ Xt, or simply Wr^ ■.. Wr^. ~ 
Wr. 


We have already seen empirically fhaf Single does 
nol meef criterion 1, because cascading errors 
cause if fo puf incorrecf enfify vecfors Xf closer 
fo XJ H4i ■ • • hFr*. than fhe correcf enfify. Comp 
mitigates fhese errors. 

To empirically verify fhaf Comp also does a bel¬ 
ter job of meeling criterion 2, we perform fhe 
following: for a pafh fype p and relation r, de¬ 
fine disl(p, r) fo be fhe angle befween Iheir corre¬ 
sponding mafrices (Irealed as vecfors in M'^^). This 
is a nalural measure because xj WrXt computes 
fhe mafrix inner producl befween Wr and XgxJ. 
Hence, any mafrix wilh small disfance from Wr 
will produce nearly fhe same scores as Wr for fhe 
same enfify pairs. 

If Comp is better af capfuring fhe correlation be- 
fween p and r, fhen we would expecf fhaf when 
prec(p) is high, composifional fraining should 
shrink disl(p, r) more. To confirm Ihis hypolhe- 
sis, we enumerated over all 676 possible pafhs of 
lengfh 2 (including inverted relafions), and exam¬ 
ined fhe proporfional reduclion in disl(p, r) caused 
by compositional fraining, 

A,../ ^ dislcoMp(Pj ~ dislsiNGLE(F) 

Adisl(p,r) =-—- - -^-. 

dlSlslNGLEfF) ) 

(16) 

Figure 4 shows fhaf higher precision pafhs indeed 
correspond fo larger reductions in disl(p, r). 

7 Related work 

Knowledge base completion with vector space 
models. Many models have been proposed for 
knowledge base complefion, including fhose re¬ 
viewed in Section 3.4 (Nickel el ah, 2011; Hor¬ 
des el ah, 2013; Yang ef ah, 2015; Socher el ah, 
2013). Dong el al. (2014) demonslraled fhaf KBC 
models can improve fhe qualify of relafion exlrac- 
lion by serving as graph-based priors. Riedel el 
al. (2013) showed fhaf such models can be also be 
direcfly used for open-domain relafion exfracfion. 
Our compositional fraining fechnique is an orthog¬ 
onal improvemenl fhaf could help any composable 
model. 

Distributional compositional semantics. Pre¬ 
vious works have explored composifional vector 
space represenlalions in fhe conlexl of logic and 
sentence inlerprelalion. In Socher el al. (2012), a 
mafrix is associated wifh each word of a senlence, 
and can be used fo recursively modify fhe mean¬ 
ing of nearby consliluenls. Grefenslelfe (2013) ex- 
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Figure 4: We divide paths of length 2 into high 
precision (> 0.3), low precision (< 0.3), and not 
co-occuring with r. Here r = nationality. Each 
box plot shows the min, max, and first and third 
quartiles of Adist(p, r). As hypothesized, com¬ 
positional training results in large decreases in 
dist(p, r) for high precision paths p, modest de¬ 
creases for low precision paths, and little to no de¬ 
creases for irrelevant paths. 

plored the ability of tensors to simulate logical cal¬ 
culi. Bowman et al. (2014) showed that recursive 
neural networks can learn to distinguish impor¬ 
tant semantic relations. Socher et al. (2014) found 
that compositional models were powerful enough 
to describe and retrieve images. 

We demonstrate that compositional representa¬ 
tions are also useful in the context of knowledge 
base querying and completion. In the aforemen¬ 
tioned work, compositional models produce vec¬ 
tors which represent truth values, sentiment or im¬ 
age features. In our approach, vectors represent 
sets of entities constituting the denotation of a 
knowledge base query. 

Path modeling. Numerous methods have been 
proposed to leverage path information for knowl¬ 
edge base completion and question answering. 
Nickel et al. (2014) proposed combining low-rank 
models with sparse path features. Lao and Cohen 
(2010) used random walks as features and Gard¬ 
ner et al. (2014) extended this approach by us¬ 
ing vector space similarity to govern random walk 
probabilities. Neelakantan et al. (2015) addressed 
the problem of path sparsity by embedding paths 
using a recurrent neural network. Perozzi et al. 


(2014) sampled random walks on social networks 
as training examples, with a different goal to clas¬ 
sify nodes in the network. Bordes et al. (2014) em¬ 
bed paths as a sum of relation vectors for question 
answering. Our approach is unique in modeling 
the denotation of each intermediate step of a path 
query, and using this information to regularize the 
spatial arrangement of entity vectors. 

8 Discussion 

We introduced the task of answering path queries 
on an incomplete knowledge base, and presented a 
general technique for compositionalizing a broad 
class of vector space models. Our experiments 
show that compositional training leads to state-of- 
the-art performance on both path query answering 
and knowledge base completion. 

There are several key ideas from this paper: reg¬ 
ularization by augmenting the dataset with paths, 
representing sets as low-dimensional vectors in 
a context-sensitive way, and performing function 
composition using vectors. We believe these three 
could all have greater applicability in the develop¬ 
ment of vector space models for knowledge repre¬ 
sentation and inference. 

Reproducibility Our code, data, and exper¬ 
iments are available on the CodaLab platform 
at https://www.codalab.org/worksheets/ 
0xfcace41fdeec45f3bc6ddf31107b829f. 
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